{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "colab": {
   "provenance": [],
   "authorship_tag": "ABX9TyPtHa1bpv1qlH9QN6TKgN33",
   "include_colab_link": true
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3"
  },
  "language_info": {
   "name": "python"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "1571259796a64a398b942576a899ef8a": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "DropdownModel",
     "model_module_version": "1.5.0",
     "state": {
      "_dom_classes": [],
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "DropdownModel",
      "_options_labels": [
       "Action",
       "Adventure",
       "Biography",
       "Comedy",
       "Crime",
       "Documentary",
       "Drama",
       "Fantasy",
       "History",
       "Horror",
       "Music",
       "Musical",
       "Mystery",
       "Romance",
       "Sci-Fi",
       "Sport",
       "Thriller",
       "War",
       "Western"
      ],
      "_view_count": null,
      "_view_module": "@jupyter-widgets/controls",
      "_view_module_version": "1.5.0",
      "_view_name": "DropdownView",
      "description": "Choose a genre:",
      "description_tooltip": null,
      "disabled": false,
      "index": 3,
      "layout": "IPY_MODEL_ef9065ee3d1741f594eb8dc97f9f3d07",
      "style": "IPY_MODEL_7f739e3ffaa24b2abde6d6d0b52ab003"
     }
    },
    "ef9065ee3d1741f594eb8dc97f9f3d07": {
     "model_module": "@jupyter-widgets/base",
     "model_name": "LayoutModel",
     "model_module_version": "1.2.0",
     "state": {
      "_model_module": "@jupyter-widgets/base",
      "_model_module_version": "1.2.0",
      "_model_name": "LayoutModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "LayoutView",
      "align_content": null,
      "align_items": null,
      "align_self": null,
      "border": null,
      "bottom": null,
      "display": null,
      "flex": null,
      "flex_flow": null,
      "grid_area": null,
      "grid_auto_columns": null,
      "grid_auto_flow": null,
      "grid_auto_rows": null,
      "grid_column": null,
      "grid_gap": null,
      "grid_row": null,
      "grid_template_areas": null,
      "grid_template_columns": null,
      "grid_template_rows": null,
      "height": null,
      "justify_content": null,
      "justify_items": null,
      "left": null,
      "margin": null,
      "max_height": null,
      "max_width": null,
      "min_height": null,
      "min_width": null,
      "object_fit": null,
      "object_position": null,
      "order": null,
      "overflow": null,
      "overflow_x": null,
      "overflow_y": null,
      "padding": null,
      "right": null,
      "top": null,
      "visibility": null,
      "width": null
     }
    },
    "7f739e3ffaa24b2abde6d6d0b52ab003": {
     "model_module": "@jupyter-widgets/controls",
     "model_name": "DescriptionStyleModel",
     "model_module_version": "1.5.0",
     "state": {
      "_model_module": "@jupyter-widgets/controls",
      "_model_module_version": "1.5.0",
      "_model_name": "DescriptionStyleModel",
      "_view_count": null,
      "_view_module": "@jupyter-widgets/base",
      "_view_module_version": "1.2.0",
      "_view_name": "StyleView",
      "description_width": ""
     }
    }
   }
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "view-in-github",
    "colab_type": "text"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/langroid/langroid/blob/main/examples/docqa/langroid-lancedb-rag-movies.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Retrieval-Augmented Analytics with Langroid + LanceDB\n",
    "\n",
    "\n",
    "Say you are working with a large dataset of movie ratings. Let's think about\n",
    "how to answer questions like this:\n",
    "> What is the highest rated Comedy movie about college students made after 2010?\n",
    "\n",
    "To answer this kind of question, we need:\n",
    "- filtering (on genre, year),\n",
    "- retrieval (semantic/lexical search on 'college students'),\n",
    "- computation (highest rated), and\n",
    "- LLM-based generation of the final answer.\n",
    "\n",
    "Of course, we'd like to automate the filtering and computation steps -- but how?\n",
    "\n",
    "\n",
    "We could use an LLM to generate a **Query Plan** for this --\n",
    "provided the underlying data store supports:\n",
    "- a filtering language \"known\" to LLMs (like SQL), and\n",
    "- a computation language \"known\" to LLMs (like a Pandas dataframe expression).\n",
    "\n",
    "This is where [LanceDB](https://github.com/lancedb/lancedb) (the default vector-db in Langroid) comes in:\n",
    "it's a versatile, highly performant, serverless vector-database that\n",
    "supports all of these functions within the same storage system and API:\n",
    "- Fast Full-text search (so you can do lexical search in the same store\n",
    "  where you do vector/semantic-search)\n",
    "- SQL-like metadata filtering\n",
    "- Pandas dataframe interop, so you can ingest dataframes and do pandas computations.\n",
    "**bold text**\n",
    "Leveraging Langroid's powerful Multi-Agent and tools orchestration, we built a\n",
    "3-Agent system consisting of:\n",
    "- Query Planner: Takes a user's query (like the above) and generates a Query Plan as a tool/function\n",
    "  consisting of: (a) a SQL-like filter, (b) a possibly rephrased query, and (c) an optional Pandas computation.\n",
    "- A RAG Agent (powered by LanceDB) that executes the query plan combining\n",
    "  filtering, RAG, lexical search, and optional Pandas computation.\n",
    "- A Query Plan Critic that examines the Query Plan and the RAG response, and\n",
    "  suggests improvements to the Query Planner, if any.\n",
    "\n",
    "This system can answer questions such as the above.\n",
    "You can try it out in this notebook, with a dataset of\n",
    "IMDB movie ratings.\n",
    "\n",
    "If you want to run it as a script, see here:\n",
    "https://github.com/langroid/langroid-examples/blob/main/examples/docqa/lance-rag-movies.py\n",
    "\n"
   ],
   "metadata": {
    "id": "b9fHPojfnbPy"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Install, setup, import"
   ],
   "metadata": {
    "id": "psOMvEL0Gekz"
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "id": "A8-Y_YPZutn6",
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "outputId": "ae2c9f85-c790-4c0f-80fc-cabd56f8a917"
   },
   "source": [
    "# Silently install, suppress all output (~2-4 mins)\n",
    "!pip install -q --upgrade langroid &> /dev/null\n",
    "!pip show langroid"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [
    "# various unfortunate things that need to be done to\n",
    "# control colab notebook behavior.\n",
    "\n",
    "# (a) output width\n",
    "\n",
    "from IPython.display import HTML, display\n",
    "\n",
    "def set_css():\n",
    "  display(HTML('''\n",
    "  <style>\n",
    "    pre {\n",
    "        white-space: pre-wrap;\n",
    "    }\n",
    "  </style>\n",
    "  '''))\n",
    "get_ipython().events.register('pre_run_cell', set_css)\n",
    "\n",
    "# (b) logging related\n",
    "import logging\n",
    "logging.basicConfig(level=logging.ERROR)\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "import logging\n",
    "for logger_name in logging.root.manager.loggerDict:\n",
    "    logger = logging.getLogger(logger_name)\n",
    "    logger.setLevel(logging.ERROR)\n",
    "\n",
    "# (c) allow async ops in colab\n",
    "!pip install nest-asyncio\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n"
   ],
   "metadata": {
    "id": "rWwH6duUzAC6",
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "outputId": "3947fda6-7de1-418e-eeb2-7717ef374c27"
   },
   "execution_count": 2,
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [
    "import pandas as pd\n",
    "from langroid.agent.special.doc_chat_agent import DocChatAgentConfig\n",
    "from langroid.agent.special.lance_doc_chat_agent import LanceDocChatAgent\n",
    "from langroid.agent.special.lance_rag.lance_rag_task import LanceRAGTaskCreator\n",
    "\n",
    "from langroid.utils.configuration import settings\n",
    "from langroid.embedding_models.models import OpenAIEmbeddingsConfig\n",
    "from langroid.vector_store.lancedb import LanceDBConfig\n",
    "settings.cache_type = \"fakeredis\"\n",
    "settings.notebook = True"
   ],
   "metadata": {
    "id": "A5N0NQwc3jX_",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 17
    },
    "outputId": "311978cb-f35e-40db-d05b-a475006db2ae"
   },
   "execution_count": 22,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### OpenAI API Key (Needs GPT4-TURBO)"
   ],
   "metadata": {
    "id": "j-6vNfKW9J7b"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "# OpenAI API Key: Enter your key in the dialog box that will show up below\n",
    "# NOTE: colab often struggles with showing this input box,\n",
    "# if so, try re-running the above cell and then this one,\n",
    "# or simply insert your API key in this cell, though it's not ideal.\n",
    "\n",
    "import os\n",
    "\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ['OPENAI_API_KEY'] = getpass('Enter your GPT4-Turbo-capable OPENAI_API_KEY key:', stream=None)\n",
    "\n",
    "\n"
   ],
   "metadata": {
    "id": "uvTODlZv3yyT",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 34
    },
    "outputId": "7b9e7857-d030-4175-a2e3-551a5d807611"
   },
   "execution_count": 4,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Get IMDB ratings & descriptions data"
   ],
   "metadata": {
    "id": "TNsZdOjmQdgx"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "# (1) Get the movies dataset\n",
    "\n",
    "import requests\n",
    "file_url = \"https://raw.githubusercontent.com/langroid/langroid-examples/main/examples/docqa/data/movies/IMDB.csv\"\n",
    "response = requests.get(file_url)\n",
    "with open('movies.csv', 'wb') as file:\n",
    "    file.write(response.content)\n",
    "\n"
   ],
   "metadata": {
    "id": "fegAio3kpgoo",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 17
    },
    "outputId": "140daf33-e39d-403e-c5f1-f8225fe2ad10"
   },
   "execution_count": 5,
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [
    " print(\n",
    "        \"\"\"\n",
    "        Welcome to the IMDB Movies chatbot!\n",
    "        This dataset has around 130,000 movie reviews, with these columns:\n",
    "\n",
    "        movie, genre, runtime, certificate, rating, stars,\n",
    "        description, votes, director.\n",
    "\n",
    "        To keep things speedy, we'll restrict the dataset to movies\n",
    "        of a specific genre that you can choose.\n",
    "        \"\"\"\n",
    "    )"
   ],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 191
    },
    "id": "J_Mv32fpOhgH",
    "outputId": "305849bf-0eb4-45c3-bce7-6839462c793c"
   },
   "execution_count": 6,
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [
    "from ipywidgets import Dropdown\n",
    "genres = [\n",
    "          \"Action\",\n",
    "          \"Adventure\",\n",
    "          \"Biography\",\n",
    "          \"Comedy\",\n",
    "          \"Crime\",\n",
    "          \"Documentary\",\n",
    "          \"Drama\",\n",
    "          \"Fantasy\",\n",
    "          \"History\",\n",
    "          \"Horror\",\n",
    "          \"Music\",\n",
    "          \"Musical\",\n",
    "          \"Mystery\",\n",
    "          \"Romance\",\n",
    "          \"Sci-Fi\",\n",
    "          \"Sport\",\n",
    "          \"Thriller\",\n",
    "          \"War\",\n",
    "          \"Western\",\n",
    "      ]\n",
    "dropdown = Dropdown(options=genres, value=genres[0], description=\"Choose a genre:\", disabled=False)\n",
    "display(dropdown)\n",
    "genre = dropdown.value"
   ],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 49,
     "referenced_widgets": [
      "1571259796a64a398b942576a899ef8a",
      "ef9065ee3d1741f594eb8dc97f9f3d07",
      "7f739e3ffaa24b2abde6d6d0b52ab003"
     ]
    },
    "id": "eRNHcFHALi67",
    "outputId": "8e0936cd-7522-438c-afda-39717affa513"
   },
   "execution_count": 7,
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [
    "# READ IN AND CLEAN THE DATA\n",
    "import pandas as pd\n",
    "df = pd.read_csv(\"movies.csv\")\n",
    "\n",
    "def clean_votes(value):\n",
    "    \"\"\"Clean the votes column\"\"\"\n",
    "    # Remove commas and convert to integer, if fails return 0\n",
    "    try:\n",
    "        return int(value.replace(\",\", \"\"))\n",
    "    except ValueError:\n",
    "        return 0\n",
    "\n",
    "# Clean the 'votes' column\n",
    "df[\"votes\"] = df[\"votes\"].fillna(\"0\").apply(clean_votes)\n",
    "\n",
    "# Clean the 'rating' column\n",
    "df[\"rating\"] = df[\"rating\"].fillna(0.0).astype(float)\n",
    "\n",
    "# Replace missing values in all other columns with '??'\n",
    "df.fillna(\"??\", inplace=True)\n",
    "df[\"description\"].replace(\"\", \"unknown\", inplace=True)\n",
    "\n",
    "# get the rows with selected genre\n",
    "df = df[df[\"genre\"].str.contains(genre)]\n",
    "\n",
    "print(\n",
    "    f\"\"\"\n",
    "[blue]There are {df.shape[0]} movies in {genre} genre, hang on while I load them...\n",
    "\"\"\"\n",
    ")\n",
    "# sample 1000 rows for faster testing\n",
    "df = df.sample(1000)"
   ],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 69
    },
    "id": "oBpuyfowOE7a",
    "outputId": "273e7074-2588-482d-b006-d9796f270d7c"
   },
   "execution_count": 8,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Set up LanceDB Vector-DB and LanceDocChatAgent"
   ],
   "metadata": {
    "id": "5rUPu_WVQprG"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "# Config LanceDB vector database\n",
    "import shutil\n",
    "db_dir = \".lancedb/data\"\n",
    "shutil.rmtree(db_dir)\n",
    "ldb_cfg = LanceDBConfig(\n",
    "    collection_name=\"chat-lance-imdb\",\n",
    "    replace_collection=True,\n",
    "    storage_path=db_dir,\n",
    "    embedding=OpenAIEmbeddingsConfig()\n",
    ")"
   ],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 17
    },
    "id": "hPDrNYNzLJMt",
    "outputId": "6cd7f681-cedc-450e-baa8-5aab3fdd1389"
   },
   "execution_count": 17,
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [
    "# configure, create LanceDocChatAgent\n",
    "cfg = DocChatAgentConfig(\n",
    "        vecdb=ldb_cfg,\n",
    "        show_stats=False,\n",
    "        add_fields_to_content=[\"movie\", \"genre\", \"certificate\", \"stars\", \"rating\"],\n",
    "        filter_fields=[\"genre\", \"certificate\", \"rating\"],\n",
    "    )\n",
    "agent = LanceDocChatAgent(cfg)\n"
   ],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 52
    },
    "id": "SXpnQCV4MF4T",
    "outputId": "82040764-a60c-418d-cfaa-7a1711a9ac14"
   },
   "execution_count": 18,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### Ingest data into LanceDocChatAgent"
   ],
   "metadata": {
    "id": "gArmf8GhQxC-"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "# Ingest the data into LanceDocChatAgent\n",
    "agent.ingest_dataframe(df, content=\"description\", metadata=[])\n",
    "df_description = agent.df_description\n",
    "\n",
    "# inform user about the df_description, in blue\n",
    "print(\n",
    "    f\"\"\"\n",
    "Here's a description of the DataFrame that was ingested:\n",
    "{df_description}\n",
    "\"\"\"\n",
    ")"
   ],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 416
    },
    "id": "JttFhOw-MSX9",
    "outputId": "a1279c79-3574-4c69-ec47-15cb1aa4e57f"
   },
   "execution_count": 19,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "source": [
    "### Create, run a 3-agent system to handle user queries\n"
   ],
   "metadata": {
    "id": "BZcvWNXDO4gt"
   }
  },
  {
   "cell_type": "code",
   "source": [
    "task = LanceRAGTaskCreator.new(agent, interactive=True)\n",
    "\n",
    "task.run(\"Can you help with some questions about these movies?\")"
   ],
   "metadata": {
    "id": "nVrqsGNFOyG4"
   },
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "source": [],
   "metadata": {
    "id": "xOTmfjXjPBn4"
   },
   "execution_count": null,
   "outputs": []
  }
 ]
}
